arrow-row 52.2.0

Arrow row format
Documentation
A comparable row-oriented representation of a collection of [`Array`]. [`Row`]s are [normalized for sorting], and can therefore be very efficiently [compared], using [`memcmp`] under the hood, or used in [non-comparison sorts] such as [radix sort]. This makes the row format ideal for implementing efficient multi-column sorting, grouping, aggregation, windowing and more, as described in more detail [in this blog post](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/). For example, given three input [`Array`], [`RowConverter`] creates byte sequences that [compare] the same as when using [`lexsort`]. ```text ┌─────┐ ┌─────┐ ┌─────┐ │ │ │ │ │ │ ├─────┤ ┌ ┼─────┼ ─ ┼─────┼ ┐ ┏━━━━━━━━━━━━━┓ │ │ │ │ │ │ ─────────────▶┃ ┃ ├─────┤ └ ┼─────┼ ─ ┼─────┼ ┘ ┗━━━━━━━━━━━━━┛ │ │ │ │ │ │ └─────┘ └─────┘ └─────┘ ... ┌─────┐ ┌ ┬─────┬ ─ ┬─────┬ ┐ ┏━━━━━━━━┓ │ │ │ │ │ │ ─────────────▶┃ ┃ └─────┘ └ ┴─────┴ ─ ┴─────┴ ┘ ┗━━━━━━━━┛ UInt64 Utf8 F64 Input Arrays Row Format (Columns) ``` _[`Rows`] must be generated by the same [`RowConverter`] for the comparison to be meaningful._ # Basic Example ``` # use std::sync::Arc; # use arrow_row::{RowConverter, SortField}; # use arrow_array::{ArrayRef, Int32Array, StringArray}; # use arrow_array::cast::{AsArray, as_string_array}; # use arrow_array::types::Int32Type; # use arrow_schema::DataType; let a1 = Arc::new(Int32Array::from_iter_values([-1, -1, 0, 3, 3])) as ArrayRef; let a2 = Arc::new(StringArray::from_iter_values(["a", "b", "c", "d", "d"])) as ArrayRef; let arrays = vec![a1, a2]; // Convert arrays to rows let converter = RowConverter::new(vec![ SortField::new(DataType::Int32), SortField::new(DataType::Utf8), ]).unwrap(); let rows = converter.convert_columns(&arrays).unwrap(); // Compare rows for i in 0..4 { assert!(rows.row(i) <= rows.row(i + 1)); } assert_eq!(rows.row(3), rows.row(4)); // Convert rows back to arrays let converted = converter.convert_rows(&rows).unwrap(); assert_eq!(arrays, converted); // Compare rows from different arrays let a1 = Arc::new(Int32Array::from_iter_values([3, 4])) as ArrayRef; let a2 = Arc::new(StringArray::from_iter_values(["e", "f"])) as ArrayRef; let arrays = vec![a1, a2]; let rows2 = converter.convert_columns(&arrays).unwrap(); assert!(rows.row(4) < rows2.row(0)); assert!(rows.row(4) < rows2.row(1)); // Convert selection of rows back to arrays let selection = [rows.row(0), rows2.row(1), rows.row(2), rows2.row(0)]; let converted = converter.convert_rows(selection).unwrap(); let c1 = converted[0].as_primitive::(); assert_eq!(c1.values(), &[-1, 4, 0, 3]); let c2 = converted[1].as_string::(); let c2_values: Vec<_> = c2.iter().flatten().collect(); assert_eq!(&c2_values, &["a", "f", "c", "e"]); ``` # Lexsort The row format can also be used to implement a fast multi-column / lexicographic sort ``` # use arrow_row::{RowConverter, SortField}; # use arrow_array::{ArrayRef, UInt32Array}; fn lexsort_to_indices(arrays: &[ArrayRef]) -> UInt32Array { let fields = arrays .iter() .map(|a| SortField::new(a.data_type().clone())) .collect(); let converter = RowConverter::new(fields).unwrap(); let rows = converter.convert_columns(arrays).unwrap(); let mut sort: Vec<_> = rows.iter().enumerate().collect(); sort.sort_unstable_by(|(_, a), (_, b)| a.cmp(b)); UInt32Array::from_iter_values(sort.iter().map(|(i, _)| *i as u32)) } ``` [non-comparison sorts]: https://en.wikipedia.org/wiki/Sorting_algorithm#Non-comparison_sorts [radix sort]: https://en.wikipedia.org/wiki/Radix_sort [normalized for sorting]: http://wwwlgis.informatik.uni-kl.de/archiv/wwwdvs.informatik.uni-kl.de/courses/DBSREAL/SS2005/Vorlesungsunterlagen/Implementing_Sorting.pdf [`memcmp`]: https://www.man7.org/linux/man-pages/man3/memcmp.3.html [`lexsort`]: https://docs.rs/arrow-ord/latest/arrow_ord/sort/fn.lexsort.html [compared]: PartialOrd [compare]: PartialOrd